11 Thinking about Data Collection
11.1 Learning Outcomes
By the end of this section, you should:
- understand some of the key questions you should ask about your data, before beginning any kind of analysis
11.2 Introduction
Data collection is usually the first step in data analytics. It involves the gathering of raw data from various sources like databases, sensors, or user interactions, to be analysed and processed later.
Often, those responsible for data analysis are not the same people who have been involved in the collection of data, which can create additional challenges and requirements for the analyst to ensure that they fully understand the nature of the data to be analysed, and any issues or challenges that it may present. This is especially true now that there is a wealth of sport data that is available online.
11.3 Key Questions
Some questions that you, as an analyst, should ask at the very outset of a project are:
What is the source of the data set? Is it a reliable and credible source? If you’ve just downloaded it from the internet, how do you know that the data is of good quality?
How was the data collected? Was it through surveys, web scraping, automated data logging, or another method?
What is the time frame of the data? Does it represent a snapshot, continuous time series, or multiple time points?
Are there any known biases or limitations in the data collection process that could impact the analysis?
How has the data set been cleaned, pre-processed, or transformed prior to being provided for analysis?
Have any modifications been made to the raw data, such as data imputation or the creation of derived variables?
What are the primary and secondary objectives of the data collection? How do these objectives align with the goals of the analysis?
Are there any known issues or discrepancies in the data set, such as missing values, duplicate entries, or data entry errors?
What are the key variables and features in the data set? Are there any specific relationships between variables that need to be considered during the analysis?
How is the data structured? Is it a flat file, hierarchical, or relational format?
Are there any data privacy or security concerns related to the data set, such as personally identifiable information (PII) or sensitive data?
How are units of measurement and scales used in the data set? Are there any differences in the units or scales between variables that need to be addressed?
How were outliers or extreme values handled during the data collection and pre-processing phases?
Has the data set been used in previous analyses or studies? If so, what insights or conclusions were drawn from those analyses, and are there any learnings that can be applied to the current analysis?
Are there any external factors or events that could have influenced the data, such as economic changes, policy shifts, or technological advancements?
Answering these questions may require discussion with those responsible for the data collection, or careful reading of any accompanying documentation (for example, if the dataset has been downloaded from an internet source).